From my introduction page we know that global warming and greenhouse effects have been a really concerning issues in recent years. Greenhouse effect is caused by the global radiative forcing as explained in the introduction page (Chandler Explained: Radiative forcing). What causes the global radiative forcing is the increasing amount of greenhouse gasses such as carbon dioxide, methane, nitrous oxide, and fluorinated gases. Carbon dioxide is the most predominant gas among the greenhouse gases (National Geographic Greenhouse effect).
The pie charts from EIA show U.S. carbon dioxide emissions comes from three sources which is petroleum 46%, natural gas 34%, and coal 21%. Those are all energy sources which takes up to 66% of the U.S. total CO2 emissions (U.S. Energy Information Administration - EIA - independent statistics and analysis). These energy sources also take up to 73.2% of the total greenhouse gas emissions (Ritchie et al. Emissions by sector). From the pie chart, we also know that there’s a 21% of the U.S. energy consumption is nonfossil which is nuclear and renewable energy (U.S. Energy Information Administration - EIA - independent statistics and analysis).
One of the project goals is trying to find out as total U.S. energy consumption are changing from traditional sources such as petroleum, natural gas, and coal to clean energies like nuclear and renewable energy, if the total U.S. CO2 emissions decreases and if we are able to predict future CO2 emissions based on energy sources variables. The total monthly petroleum, natural gas, coal consumption, and renewable energy consumption datasets were collected and shown in the previous tabs, and they will be treated as variables in regards to CO2 emissions in the following time series models.
ARMAX/ARIMAX/SAIMAX
The names ARMAX and ARIMAX come as extensions of the ARMA and ARIMA respectively. The X added to the end stands for “exogenous”. In other words, it suggests adding a separate different outside variable to help measure our endogenous variable.
ARMAX - an AutoRegressive-Moving Average with eXogenous terms (ARMAX) model. Unlike the autoregressive with exogenous terms (ARX) model, the system structure of an ARMAX model includes the stochastic dynamics. ARMAX models are useful when you have dominating disturbances that enter early in the process, such as at the input. For example, a wind gust affecting an aircraft is a dominating disturbance early in the process. The ARMAX model has more flexibility than the ARX model in handling models that contain disturbances.
ARIMAX - An Autoregressive Integrated Moving Average with Explanatory Variable (ARIMAX) model.
The names ARMAX and ARIMAX come as extensions of the ARMA and ARIMA respectively. The X added to the end stands for “exogenous”. In other words, it suggests adding a separate different outside variable to help measure our endogenous variable.
SARIMAX - Seasonal AutoRegressive Integrated Moving Average with eXogenous regressors model
dd.ts<-ts(dd,star=decimal_date(as.Date("1973-01-01",format ="%Y-%m-%d")),frequency =12)autoplotly(dd.ts[,c(2:6)], facets=TRUE) +xlab("Year") +ylab("") +ggtitle("Variables influencing CO2 Emissions in USA")
As stated in literature review, we are interested in the impacts of U.S. major energy sources on U.S. total CO2 emissions. The energy data were collected and have been shown in the data sources tab. We further used these data and build models and predictions in ARMA/ARIMA/SARIMA Models tab. The unit of CO2 emissions is million metric tons of carbon dioxide, the unit of renewable energy, petroleum, natural gas, and coal is Trillion Btu. The time series starts from Jan 1973 and ends in yet includes Dec 2022.
Fitting SARIMAX model using auto.arima()
Using auto.arima() function here to fit the ARIMAX model. Here we are trying to predict CO2 Emissions using Renewable Energy, Petroleum, Natural gas, and Coal. All variables are time series and the exogenous variables in this case are Renewable Energy, Petroleum, Natural gas, and Coal.
Series: dd.ts[, "CO2 Emissions"]
Regression with ARIMA(4,0,0)(0,1,1)[12] errors
Coefficients:
ar1 ar2 ar3 ar4 sma1 Rnwbl Ptlm Ntlgs Cl
0.7384 0.0817 0.0343 0.1269 -0.7452 -0.0017 0.0728 0.1218 -0.0065
s.e. 0.0417 0.0510 0.0512 0.0412 0.0287 0.0166 0.0045 0.0040 0.0069
sigma^2 = 85.53: log likelihood = -2143.09
AIC=4306.19 AICc=4306.57 BIC=4349.95
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set -0.1968328 9.084863 6.881898 -0.03815555 1.18767 0.3113222
ACF1
Training set -0.003652219
Code
checkresiduals(fit)
Ljung-Box test
data: Residuals from Regression with ARIMA(4,0,0)(0,1,1)[12] errors
Q* = 33.387, df = 19, p-value = 0.02168
Model df: 5. Total lags used: 24
From the results above, we know that this is a regression model with SARIMA(4,0,0)(0,1,1)[12].
Fitting the model manually - Regression with ARMA Errors
Here we first have to fit the linear regression model predicting CO2 Emissions using Renewable Energy, Petroleum, Natural gas, and Coal.
Then for the residuals, we will fit an ARIMA/SARIMA model.
First fit the linear model
Code
dd$`CO2 Emissions`<-ts(dd$`CO2 Emissions`, star=decimal_date(as.Date("1973-01-01",format ="%Y-%m-%d")),frequency =12)dd$Renewable <-ts(dd$Renewable,star=decimal_date(as.Date("1973-01-01",format ="%Y-%m-%d")),frequency =12)dd$Petroleum <-ts(dd$Petroleum,star=decimal_date(as.Date("1973-01-01",format ="%Y-%m-%d")),frequency =12)dd$`Natural Gas`<-ts(dd$`Natural Gas`,star=decimal_date(as.Date("1973-01-01",format ="%Y-%m-%d")),frequency =12)dd$Coal <-ts(dd$Coal,star=decimal_date(as.Date("1973-01-01",format ="%Y-%m-%d")),frequency =12)# First fit the linear modelfit.reg <-lm(`CO2 Emissions`~ Renewable + Petroleum +`Natural Gas`+ Coal, data=dd)summary(fit.reg)
Call:
lm(formula = `CO2 Emissions` ~ Renewable + Petroleum + `Natural Gas` +
Coal, data = dd)
Residuals:
Min 1Q Median 3Q Max
-107.596 -27.274 -1.138 24.142 126.391
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.015e+02 2.092e+01 -9.630 < 2e-16 ***
Renewable -1.098e-01 1.316e-02 -8.348 4.88e-16 ***
Petroleum 1.957e-01 6.690e-03 29.252 < 2e-16 ***
`Natural Gas` 5.652e-02 4.124e-03 13.705 < 2e-16 ***
Coal 1.088e-01 4.877e-03 22.308 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 39.52 on 595 degrees of freedom
Multiple R-squared: 0.7462, Adjusted R-squared: 0.7445
F-statistic: 437.3 on 4 and 595 DF, p-value: < 2.2e-16
Warning in adf.test(diff(diff(res.fit), lag = 12)): p-value smaller than
printed p-value
Augmented Dickey-Fuller Test
data: diff(diff(res.fit), lag = 12)
Dickey-Fuller = -10.178, Lag order = 8, p-value = 0.01
alternative hypothesis: stationary
Code
diff.res.fit <- res.fit |>diff() |>diff(lag=12) isSeasonal(diff.res.fit, test ="combined", freq =NA)
[1] FALSE
After first differencing and seasonality differencing, the data is proved to be stationary. Now let’s find the model parameters of the time series linear regression residual data.
Find the Model Parameters
Code
#q=1,2, Q=1,2 , p=1,2, P=0,1,2,3#write a funtionSARIMA.c=function(p1,p2,q1,q2,P1,P2,Q1,Q2,data){ temp=c() d=1 D=1 s=12 i=1 temp=data.frame() ls=matrix(rep(NA,9*200),nrow=200)for (p in p1:p2) {for(q in q1:q2) {for(P in P1:P2) {for(Q in Q1:Q2) {if(p+q+P+D+Q<=8) { model<-Arima(data,order=c(p-1,d,q-1),seasonal=c(P-1,D,Q-1)) ls[i,]=c(p-1,d,q-1,P-1,D,Q-1,model$aic,model$bic,model$aicc) i=i+1#print(i) } } } } } temp=as.data.frame(ls)names(temp)=c("p","d","q","P","D","Q","AIC","BIC","AICc") temp}
Best model from the output is SARIMA(1,1,1)x(0,1,1)[12]. auto.arima() suggested SARIMA(2,0,1)(2,1,1)[12]
Model diagnostics
From model fitting, we generated 1 model, SARIMA((1,1,1)x(0,1,1)[12]. auto.arima() generated SARIMA(2,0,1) x (2,1,1)[12]. It looks like SARIMA(1,1,1)x(0,1,1)[12] has the lower AIC, BIC, and AICc. Now let’s do two model diagnoses to analyze the result and find the better model to do forecast later.
SARIMA(2,0,1)(2,1,1)[12]
Code
set.seed(1234)fit1 <-Arima(res.fit, order=c(2,0,1), seasonal =list(order =c(2,1,1), period =12))
res.fit |>Arima(order=c(2,0,1), seasonal =list(order =c(2,1,1), period =12)) |>residuals() |>ggtsdisplay()
Code
checkresiduals(fit1)
Ljung-Box test
data: Residuals from ARIMA(2,0,1)(2,1,1)[12]
Q* = 36.154, df = 18, p-value = 0.006743
Model df: 6. Total lags used: 24
The Ljung-Box test uses the following hypotheses:
H0: The residuals are independently distributed.
HA: The residuals are not independently distributed; they exhibit serial correlation.
Ideally, we would like to fail to reject the null hypothesis. That is, we would like to see the p-value of the test be greater than 0.05 because this means the residuals for our time series model are independent, which is often an assumption we make when creating a model.
There are two significant spikes in the ACF, and the model didn’t fail the Ljung-Box test.
Series: res.fit
ARIMA(1,1,1)(0,1,1)[12]
Coefficients:
ar1 ma1 sma1
0.4485 -0.8254 -0.8340
s.e. 0.0637 0.0409 0.0232
sigma^2 = 229.8: log likelihood = -2434.7
AIC=4877.4 AICc=4877.47 BIC=4894.9
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set -0.2983109 14.95631 11.52741 -17.96732 111.7469 0.6125249
ACF1
Training set -0.01297584
Model Selection
From two model diagnostics above, both SARIMA(1,1,1)(0,1,1)[12] and SARIMA(1,1,1) x (2,1,2)[12] model have similar number of spikes in the ACF and PACF plots of its residuals. All the training set error measures of the two models are similar. SARIMA(1,1,1)(0,1,1)[12] model has a slightly smaller sigma squared which means it has smaller variance. The estimators with a smaller variance is more efficient.
---title: "ARIMAX/SARIMAX/VAR"format: html: embed-resources: true self-contained: true code-fold: true code-copy: true code-tools: true code-overflow: wrap---# Literature ReviewFrom my introduction page we know that global warming and greenhouse effects have been a really concerning issues in recent years. Greenhouse effect is caused by the global radiative forcing as explained in the introduction page (Chandler *Explained: Radiative forcing*). What causes the global radiative forcing is the increasing amount of greenhouse gasses such as carbon dioxide, methane, nitrous oxide, and fluorinated gases. Carbon dioxide is the most predominant gas among the greenhouse gases (National Geographic *Greenhouse effect*).The pie charts from EIA show U.S. carbon dioxide emissions comes from three sources which is petroleum 46%, natural gas 34%, and coal 21%. Those are all energy sources which takes up to 66% of the U.S. total CO2 emissions (*U.S. Energy Information Administration - EIA - independent statistics and analysis*). These energy sources also take up to 73.2% of the total greenhouse gas emissions (Ritchie et al. *Emissions by sector*). From the pie chart, we also know that there's a 21% of the U.S. energy consumption is nonfossil which is nuclear and renewable energy (*U.S. Energy Information Administration - EIA - independent statistics and analysis*).One of the project goals is trying to find out as total U.S. energy consumption are changing from traditional sources such as petroleum, natural gas, and coal to clean energies like nuclear and renewable energy, if the total U.S. CO2 emissions decreases and if we are able to predict future CO2 emissions based on energy sources variables. **The total monthly petroleum, natural gas, coal consumption, and renewable energy consumption datasets were collected and shown in the previous tabs, and they will be treated as variables in regards to CO2 emissions in the following time series models.**# ARMAX/ARIMAX/SAIMAXThe names ARMAX and ARIMAX come as extensions of the ARMA and ARIMA respectively. The X added to the end stands for "exogenous". In other words, it suggests adding a separate different outside variable to help measure our endogenous variable.ARMAX - an AutoRegressive-Moving Average with eXogenous terms (ARMAX) model. Unlike the autoregressive with exogenous terms (ARX) model, the system structure of an ARMAX model includes the stochastic dynamics. ARMAX models are useful when you have dominating disturbances that enter early in the process, such as at the input. For example, a wind gust affecting an aircraft is a dominating disturbance early in the process. The ARMAX model has more flexibility than the ARX model in handling models that contain disturbances.ARIMAX - An Autoregressive Integrated Moving Average with Explanatory Variable (ARIMAX) model.The names ARMAX and ARIMAX come as extensions of the ARMA and ARIMA respectively. The X added to the end stands for "exogenous". In other words, it suggests adding a separate different outside variable to help measure our endogenous variable.SARIMAX - Seasonal AutoRegressive Integrated Moving Average with eXogenous regressors model# Modeling U.S. Total CO2 Emissions Data - ARIMAX## Plotting the Data```{r,echo=FALSE,message=FALSE,warning=FALSE}library(tidyverse)library(ggplot2)library(forecast)library(astsa) library(xts)library(tseries)library(fpp2)library(fma)library(lubridate)library(tidyverse)library(TSstudio)library(quantmod)library(tidyquant)library(plotly)library(ggplot2)library(ggfortify)library(autoplotly)library(ggpubr)library(seastests)theme_set(theme_gray(base_size=12,base_family="Palatino"))``````{r}emissions_df<-read.csv("data/df_total_monthly_CO2_emissions.csv", skip=1, row.names =1)renewable_df <-read.csv("data/df_total_monthly_renewable_consumption.csv", skip=1, row.names =1)naturalgas_df <-read.csv("data/df_total_monthly_natural_gas_consumption.csv", skip=1, row.names =1)coal_df <-read.csv("data/df_total_monthly_coal_consumption.csv", row.names =1)petroleum_df <-read.csv("data/df_total_monthly_df_monthly_petroleum_consumption.csv", skip=1, row.names =1)dd <-data.frame(emissions_df$X.1,emissions_df$sum,renewable_df$sum,petroleum_df$sum,naturalgas_df$sum,coal_df$value)#dd<-dd[,c(1,2,4,6)]colnames(dd)<-c("DATE","CO2 Emissions","Renewable","Petroleum","Natural Gas","Coal")knitr::kable(head(dd))``````{r}dd.ts<-ts(dd,star=decimal_date(as.Date("1973-01-01",format ="%Y-%m-%d")),frequency =12)autoplotly(dd.ts[,c(2:6)], facets=TRUE) +xlab("Year") +ylab("") +ggtitle("Variables influencing CO2 Emissions in USA")```As stated in literature review, we are interested in the impacts of U.S. major energy sources on U.S. total CO2 emissions. The energy data were collected and have been shown in the data sources tab. We further used these data and build models and predictions in ARMA/ARIMA/SARIMA Models tab. The unit of CO2 emissions is *million metric tons of carbon dioxide*, the unit of renewable energy, petroleum, natural gas, and coal is *Trillion Btu*. The time series starts from Jan 1973 and ends in yet includes Dec 2022.## Fitting SARIMAX model using `auto.arima()`Using `auto.arima()` function here to fit the ARIMAX model. Here we are trying to predict `CO2 Emissions` using `Renewable Energy`, `Petroleum`, `Natural gas`, and `Coal`. All variables are time series and the exogenous variables in this case are `Renewable Energy`, `Petroleum`, `Natural gas`, and `Coal`.```{r}# "CO2 Emissions","Renewable","Petroleum","Natural Gas","Coal"xreg <-cbind(Rnwbl = dd.ts[, "Renewable"],Ptlm = dd.ts[, "Petroleum"],Ntlgs = dd.ts[, "Natural Gas"],Cl = dd.ts[, "Coal"])fit <-auto.arima(dd.ts[, "CO2 Emissions"], xreg = xreg, seasonal=TRUE)summary(fit)``````{r}checkresiduals(fit)```From the results above, we know that this is a regression model with SARIMA(4,0,0)(0,1,1)\[12\].## Fitting the model manually - Regression with ARMA ErrorsHere we first have to fit the linear regression model predicting `CO2 Emissions` using `Renewable Energy`, `Petroleum`, `Natural gas`, and `Coal`.Then for the `residuals`, we will fit an ARIMA/SARIMA model.### First fit the linear model```{r}dd$`CO2 Emissions`<-ts(dd$`CO2 Emissions`, star=decimal_date(as.Date("1973-01-01",format ="%Y-%m-%d")),frequency =12)dd$Renewable <-ts(dd$Renewable,star=decimal_date(as.Date("1973-01-01",format ="%Y-%m-%d")),frequency =12)dd$Petroleum <-ts(dd$Petroleum,star=decimal_date(as.Date("1973-01-01",format ="%Y-%m-%d")),frequency =12)dd$`Natural Gas`<-ts(dd$`Natural Gas`,star=decimal_date(as.Date("1973-01-01",format ="%Y-%m-%d")),frequency =12)dd$Coal <-ts(dd$Coal,star=decimal_date(as.Date("1973-01-01",format ="%Y-%m-%d")),frequency =12)# First fit the linear modelfit.reg <-lm(`CO2 Emissions`~ Renewable + Petroleum +`Natural Gas`+ Coal, data=dd)summary(fit.reg)```### Linear Model Residuals::: panel-tabset## Original Time Series Residuals```{r}res.fit <-ts(residuals(fit.reg), star=decimal_date(as.Date("1973-01-01",format ="%Y-%m-%d")),frequency =12)ggtsdisplay(res.fit)```## First Diff```{r}res.fit |>diff() |>ggtsdisplay()```## First Diff and Seasonality Diff```{r}res.fit |>diff() |>diff(lag=12) |>ggtsdisplay()```:::### Stationarity and Seasonality Test```{r}res.fit |>diff() |>diff(lag=12) |>adf.test()``````{r}diff.res.fit <- res.fit |>diff() |>diff(lag=12) isSeasonal(diff.res.fit, test ="combined", freq =NA)```After first differencing and seasonality differencing, the data is proved to be stationary. Now let's find the model parameters of the time series linear regression residual data.### Find the Model Parameters```{r}#q=1,2, Q=1,2 , p=1,2, P=0,1,2,3#write a funtionSARIMA.c=function(p1,p2,q1,q2,P1,P2,Q1,Q2,data){ temp=c() d=1 D=1 s=12 i=1 temp=data.frame() ls=matrix(rep(NA,9*200),nrow=200)for (p in p1:p2) {for(q in q1:q2) {for(P in P1:P2) {for(Q in Q1:Q2) {if(p+q+P+D+Q<=8) { model<-Arima(data,order=c(p-1,d,q-1),seasonal=c(P-1,D,Q-1)) ls[i,]=c(p-1,d,q-1,P-1,D,Q-1,model$aic,model$bic,model$aicc) i=i+1#print(i) } } } } } temp=as.data.frame(ls)names(temp)=c("p","d","q","P","D","Q","AIC","BIC","AICc") temp}``````{r, warning=FALSE}#q=1,2, Q=1,2 , p=0,1,2, P=0,1,2,3 output=SARIMA.c(p1=1,p2=3,q1=1,q2=3,P1=1,P2=3,Q1=1,Q2=3,data=res.fit) |>drop_na()minaic <- output[which.min(output$AIC), ]minbic <- output[which.min(output$BIC), ]```::: panel-tabset## Parameters with Minimum AIC```{r}knitr::kable(minaic)```## Parameters with Minimum BIC```{r}knitr::kable(minbic)```## Using `auto.arima()````{r}auto.arima(res.fit)```:::Best model from the output is SARIMA(1,1,1)x(0,1,1)\[12\]. `auto.arima()` suggested SARIMA(2,0,1)(2,1,1)\[12\]| <br>### Model diagnosticsFrom model fitting, we generated 1 model, SARIMA((1,1,1)x(0,1,1)\[12\]. `auto.arima()` generated SARIMA(2,0,1) x (2,1,1)\[12\]. It looks like SARIMA(1,1,1)x(0,1,1)\[12\] has the lower AIC, BIC, and AICc. Now let's do two model diagnoses to analyze the result and find the better model to do forecast later.#### SARIMA(2,0,1)(2,1,1)\[12\]```{r}set.seed(1234)fit1 <-Arima(res.fit, order=c(2,0,1), seasonal =list(order =c(2,1,1), period =12))```#### Model Fitting Visual Results and Residuals```{r}model_output <-capture.output(sarima(res.fit,2,0,1,2,1,1,12))``````{r}res.fit |>Arima(order=c(2,0,1), seasonal =list(order =c(2,1,1), period =12)) |>residuals() |>ggtsdisplay()``````{r}checkresiduals(fit1)```*The Ljung-Box test uses the following hypotheses:**H~0~: The residuals are independently distributed.**H~A~: The residuals are not independently distributed; they exhibit serial correlation.**Ideally, we would like to fail to reject the null hypothesis. That is, we would like to see the p-value of the test be greater than 0.05 because this means the residuals for our time series model are independent, which is often an assumption we make when creating a model.*> <br>There are two significant spikes in the ACF, and the model didn't fail the Ljung-Box test.#### Model Fitting```{r}res.fit |>Arima(order=c(2,0,1), seasonal =list(order=c(2,1,1), period=12))```#### Model Output Diagnostics```{r}cat(model_output[65:100], model_output[length(model_output)], sep ="\n") ``````{r}summary(fit1)```### SARIMA(1,1,1)x(0,1,1)\[12\]```{r}fit2 <-Arima(res.fit , order=c(1,1,1), seasonal =list(order =c(0,1,1), period =12))```#### Model Fitting Visual Results and Residuals```{r}model_output2 <-capture.output(sarima(res.fit,1,1,1,0,1,1,12))``````{r}res.fit |>Arima(order=c(1,1,1), seasonal =list(order =c(0,1,1), period =12)) |>residuals() |>ggtsdisplay()``````{r}checkresiduals(fit2)```There is one significant spike in the ACF, and the model didn't fail the Ljung-Box test.#### Model Fitting```{r}res.fit |>Arima(order=c(1,1,1), seasonal =list(order =c(0,1,1), period =12))```#### Model Output Diagnostics```{r}cat(model_output2[30:61], model_output2[length(model_output2)], sep ="\n")``````{r}summary(fit2)```| <br>### Model SelectionFrom two model diagnostics above, both SARIMA(1,1,1)(0,1,1)\[12\] and SARIMA(1,1,1) x (2,1,2)\[12\] model have similar number of spikes in the ACF and PACF plots of its residuals. All the training set error measures of the two models are similar. SARIMA(1,1,1)(0,1,1)\[12\] model has a slightly smaller sigma squared which means it has smaller variance. The estimators with a smaller variance is more efficient.### Cross Validation```{r, warning=F}n=length(res.fit)k=120#n-k=480; 480/12=40;rmse1 <-matrix(NA, 40,12)rmse2 <-matrix(NA,40,12)rmse3 <-matrix(NA,40,12)st <-tsp(res.fit)[1]+(k-1)/12for(i in1:10){#xtrain <- window(a10, start=st+(i-k+1)/12, end=st+i/12) xtrain <-window(res.fit, end=st + i-1) xtest <-window(res.fit, start=st + (i-1) +1/12, end=st + i)#ARIMA(1,1,2)x(0,1,0)[12]. auto.arima() - ARIMA(2,0,1)(2,1,1)[12] fit <-Arima(xtrain, order=c(2,0,1), seasonal=list(order=c(2,1,1), period=12),include.drift=TRUE, method="CSS") fcast <-forecast(fit, h=40) fit2 <-Arima(xtrain, order=c(1,1,1), seasonal=list(order=c(0,1,1), period=12),include.drift=TRUE, method="CSS") fcast2 <-forecast(fit2, h=40) fit3 <-Arima(xtrain, order=c(2,1,1), seasonal=list(order=c(2,1,1), period=12),include.drift=TRUE, method="CSS") fcast3 <-forecast(fit3, h=40) rmse1[i,1:length(xtest)] <-sqrt((fcast$mean-xtest)^2) rmse2[i,1:length(xtest)] <-sqrt((fcast2$mean-xtest)^2) rmse3[i,1:length(xtest)] <-sqrt((fcast3$mean-xtest)^2)}plot(1:12, colMeans(rmse1,na.rm=TRUE), type="l", col=2, xlab="horizon", ylab="RMSE")lines(1:12, colMeans(rmse2,na.rm=TRUE), type="l",col=3)lines(1:12, colMeans(rmse3,na.rm=TRUE), type="l",col=4)legend("topleft",legend=c("fit1","fit2","fit3"),col=2:4,lty=1)```This fit is good based on low RMSE: SARIMA(2,0,1)x(2,1,1)\[12\]### Model Fitting - SARIMA(2,0,1)x(2,1,1)\[12\]```{r}xreg <-cbind(RNWBL = dd[, "Renewable"],PTRLM = dd[, "Petroleum"],NTRLG = dd[, "Natural Gas"],CL = dd[, "Coal"])fit <-Arima(dd$`CO2 Emissions`,order=c(2,0,1),seasonal =c(2,1,1),xreg=xreg,method="CSS")summary(fit)```## Forecast - SARIMA(2,0,1)x(2,1,1)\[12\]In order to forecast `CO2 Emissions` variable, or the whole fit, we need to have forecasts of `Renewable Energy`, `Petroleum`, `Natural gas`, and `Coal`.Here we will be using `auto.arima()` to fit the `Renewable Energy`, `Petroleum`, `Natural gas`, and `Coal` variables.### Fitting SARIMA model to different energy consumption variables::: panel-tabset## Renewable Energy```{r}RNWBL_fit <-auto.arima(dd$Renewable) #fiting an ARIMA model to the CO2 emissions variablesummary(RNWBL_fit)``````{r}fRNWBL <-forecast(RNWBL_fit, h=60)```## Petroleum```{r}PTRLM_fit <-auto.arima(dd$Petroleum) #fiting an ARIMA model to the petroleum variablesummary(PTRLM_fit)``````{r}fPTRLM <-forecast(PTRLM_fit, h=60)```## Natural Gas```{r}NTRLG_fit <-auto.arima(dd$`Natural Gas`) #fiting an ARIMA model to the petroleum variablesummary(NTRLG_fit)``````{r}fNTRLG <-forecast(NTRLG_fit, h=60)```## Coal```{r}CL_fit <-auto.arima(dd$Coal) #fiting an ARIMA model to the petroleum variablesummary(CL_fit)``````{r}fCL <-forecast(CL_fit, h=60)```:::```{r}fxreg <-cbind(RNWBL = fRNWBL$mean,PTRLM = fPTRLM$mean,NTRLG = fNTRLG$mean,CL = fCL$mean)fcast <-forecast(fit, xreg=fxreg) #fimp$mean gives the forecasted valuesg <-autoplot(fcast) +xlab("Year") +ylab("Million Metric Tons of Carbon Dioxide") +ggtitle("CO2 Emissions Forecast from SARIMAX Model")ggplotly(g)```